RDD, Dataframe or Datasets? When & Why?

Three basic API's in Spark

Wed, 13 Jan 2021

RDD, Dataframes & Datasets in Spark

Spark is every data engineer’s go-to processing engine. What makes Spark so sought after is its ability to do every task ranging from batch & stream processing to ML functions. Not only this but the API’s provided by Spark to handle distributed large scale data are extremely intuitive , easy to use and highly efficient.

The three major types of API’s provided in Spark include:

  • RDD
  • Dataframe
  • Datasets

TL; DR;

Go for RDD when you need:

  • Low-level transformation and actions and total control on your dataset
  • Have unstructured or semi-structured data
  • The in-built transformations do not serve your purpose and you need to write functional transformations
  • SQL like view is not needed or your processing doesn’t really depend on it
  • Optimization or performance benefits are not your concern majorly or you can take care of them at your own

Go for Dataframes/Datasets when you need:

  • Your data is structured
  • The in-built transformations such as filters, maps, aggregations are enough to do your tasks. We can always use lambda functions in case of something that is not available
  • The compile type safety is required, then Datasets is the only choice
  • You want to take proper benefit of catalyst optimization and Tungsten’s efficient code generation provided by Spark
  • Python/R (dynamically typed languages) users can only use dataframes
  • Java/Scala (strongly typed languages) users can use dataframes and datasets

RDD

RDD ( Resilient Distributed Datasets) is the most basic level of abstraction provided by Spark. As Spark uses HDFS as its primary storage, the first step in any Spark application is to divide the data among all the nodes of the cluster. That basic unit of data stored on every node of the cluster is called an RDD.

Considering it simple, since Hadoop is all about multiple commodity hardwares in a cluster, so when we store data on HDFS, the dataset stored on hard disk of every system in our cluster is an RDD

RDD was the primitive data structure provided by Spark.Where it provided the ability of storing and processing huge amount of data in a distributed manner, the burden of optimization always falls on developer.This is because in case of RDD, Spark has no idea about what the data is.What is the data type and schema and how can it be structured and stored in a way that makes further processing cost effective and optimized. A distributed data processing requires shuffling and sorting based upon the type of operations so in case of RDD its the developers duty to store the data in a way that makes these operations easier and efficient.

The next abstraction provided by Spark is Dataframe which, very efficiently, addresses the issue of query optimization and takes the burden off the shoulders of a developer, providing him a very easy-to-use SQL like schema that can be interacted with like a normal relation database table.

Dataframes

Spark is all about taking your very large ‘not-being-able-to-fit-on-a-hard-disk’ dataset and divides and saves it onto hard disks of multiple computers in your cluster. Next, whenever you need to run a job , the task is sent to every computer and the processor runs the task on the partition of dataset it has and the end result is then combined by Spark and provided to you in one whole chunk. This is what normally happens in a very basic MapReduce job. What looks such simple a job at surface is a whole different story underneath. Largely, how the data is stored initially effects the processing much, both in terms of computing cost and time.

In the case of RDD, Spark had no idea of what your data is. Nothing is being known about your data types or schema, hence optimization is totally out of question. The most Spark could do is take your dataset and number of partition available and divide the data among them. But in case of Dataframes, Spark knows the datatypes and schema so storing can be optimized in a way that later improves application’s performance when tasks involving operations like shuffling, grouping or aggregations are performed on such smartly distributed datasets.

So this is the benefit provided by dataframes as compared to RDD’s. You tell Spark how my data looks like and rest it takes care of it. It gives you back a SQL table that you can run conventional SQL queries upon. A developer still has to take care of certain factors like partitioning, data skews and shuffling while performing data processing tasks but still performance optimization is not his job solely. Generically, its taken care of, specifically you do according to your task.

Datasets

Just like a SQL table entries are a collection of rows (one record in a SQL table), a dataframe is a collection of datasets. One row in a dataframe is a dataset. Then why do we need Datasets as a separate abstraction or API. The reason is type safety. Type safety is the software engineering term to make sure you store right type of data in right data structure. If you try to store an integer into a string data structure, you are violating type safety rules and hence you get an error. No matter what programming language you use, you will get an error when you will break the type safety rule, what differs between programming languages is the step where they raise the error.

We differentiate programming languages as strongly typed or dynamically typed based upon when they will notice the type safety violation error. The strongly typed languages decide the datatype at compile time hence the error is raised at compile time too e.g. Java and Scala. The dynamically typed languages decide the data type of a variable only on runtime so error is only raised at runtime. Although, the flexibility to not limit yourself to data types is charmin but the pain of getting a type error after your code has been running for two hours and that too on a distributed system is a real pain.

Hence, the third API provided by Spark is Datasets. Datasets have two version, typed and untyped. Untyped datasets are used for Pyspark since Python is a dynamically typed language and the collection of untyped datasets is what you make use of as dataframes.

The typed datasets has been provided in Scala and Java and is used in cases where data types have to be strictly taken care of.

Take Aways

None of these data structures in Spark are depreceated and each one of them has their own pros and cons and particular use cases. Which one to choose depends entirely upon the data we have at hand and the task we want to perform over it. Nevertheless, the flexibility to choose and ease of use provided by Spark in all three of them is beyond remarkable.

Loading...
Ramsha Bukhari

Ramsha Bukhari

  • A personal blog intended to have a digital record of the things that excite me <> Tech, Photos and Books
  • delivered by Netlify